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1.  Abstract 

This  paper  doscribes  an  algorithm  for  compressing  the  spectral  representation  of  an  utterance 
along  the  time  axis  while  keeping  the  main  features  intact.  The  goal  of  the  algorithm  is  to  save 
template  storage  space. and  to  reduce  the  time  required  for  recognition.  For  8  speakers,  5  data  sets 
each,  the  results  indicated  that  we  can  save  about  40%  of  the  template  space  and  35%  of  the 
recognition  time  with  only  a  slightly  higher  error  rate. 

2.  Introduction 

In  speech  recognition  dynamic  programming  is  commonly-used  to  time  align  the  test  utterance  and 
reference  utterance  frame  by  frame.[1][2] 

In  checking  the  feature  parameters  (spectral  data  In  our  case)  in  one  utterance,  we  often  find 
contiguous  frames  which  have  almost  the  same  feature  parameters.  We  can  say  they  are  similar  to 
each  other  and  need  not  all  to  be  matched  one  by  one.  We  can  just  keep  one  frame  and  delete  others. 
This  process  is  called  "frame  compression"  because  several  frames  are  compressed  into  one  frame. 

It  Is  obvious  that  frame  compression  would  save  space  and  time  for  warping.  It  is  also  possible  that 
frame  compression  keeps  the  main  features  of  the  utterance  when  it  is  done  appropriately. 

From  this  Idea,  we  developed  an  algorithm  for  frame  compression  and  tested  it  on  a  large  speech 
data  set.  The  results  indicate  the  feasibility  of  this  approach. 


3.  The  Algorithm 

Fig.  3-1  shows  the  flow  chart  of  the  compression  procedure.  The  Input  is  the  uncompressed 
spectral  data  (15  coefficients  for  every  frame  and  4  bits  per  coefficient).  Let  us  assume  that  an 
utterance  has  N  frames  labeled  0  to  N-1.  The  compressing  process  consists  of  calculating  Euclidean 
distances  between  a  frame  and  a  number  of  its  neighbors,  then  marking  the  frame  for  either  retention 
or  deletion.  In  the  figure,  d[-1]  is  the  distance  between  frame  i  and  frame  M.  d[  +  j]  is  the  distance 
between  frame  i  and  frame  i  +  j  (j  =  1 ,2,3).  s[i]  is  the  mark  which  indicates  whether  frame  i  should  be 
deleted  (with  mark)  or  not.  The  output  then  consists  of  the  frames  with  a  "  +  "  mark  only.  The 
decision  of  the  mark  depends  on  the  distances  compared  with  a  threshold  T.  (At  present  a  value  of 
T  =  25  is  used).  Table  3-1  shows  a  typical  segmentation  trace  of  the  utterance  ".B". 
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Table  3- 1 :  As  an  illustration  of  the  operation  of  the  algorithm 
as  shown  in  the  flow  chatt,  this  table  gives  the  codlicionts,  distances 
and  mark  lor  every  frame.  In  the  table.  cO  to  cl  1  are  the  15  coefficients. 
d[-1  J,  df  +  1],  d[  +  2]  and  d[  +  3]  are  the  four  distances,  fn  is  the  frame 
number  and  s  is  the  mark  which  indicates  that  the  corresponding  frame 
should  be  deleted  ("•")  or  not  ("  + "). 
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4.  Results  of  Experiment  and  Discussion 

The  experiment  was  Perform  on  a  VAX-  t  ,/780  computer  using  ,he  Clcada2  system  as  described 

elsewhere.[3][4]  The  experiment  was  done  by  using  the  data  of  ^  w 

'  uo'  'a  ine  aaia  ot  4  male  speakers  and  4  female 
speakers.  For  every  speaker  five  data  sets  were  used  a<;  met  _  . 

re.  _  trcl  e  u  ^eis  were  used  as  test  sets  and  one  data  set  was  used  as 

reference  «M-  Each  se,  consists  ot  36  utterances  (to  digits  and  the  26  letters  ot  the  alphabet)  All 
utterances  have  automatically  determined  endpoints.  Table  4. ,  gives  the  recognition  results. 


From  Table  4-1  we  can  see  that  accuracy  of  using 
using  noncompressed  data.  The  overall  error  rate  (in 
test  utterances  ( *  1440). 


compressed  data  is  somewhat  inferior  to  that 
percent)  is  calculated  by  sum/total  number  of 


Table  4- 1 :  Comparison  of  compression  vs  noncompression 


speaker 

errors(com) 

errors( 

ds 

26 

21 

fa 

14 

9 

99 

23 

24 

jl 

14. 

27 

ma 

33 

22 

ms 

19 

13 

rp 

4 

6 

sw 

33 

34 

sum 

166 

166 

X 

11.6 

10.6 

Table  4-2  shows  the  percentage  of  the  frames  d« 

i a  from  an  utterance  for  four  speakers.  On  the 

::r  *zim  ,rames  were  deieied- Thb  ind,ca,“ ,ha' we  can  ^ 

::::::: ema  ,,ha  ^ ^ •  -  -  -  —  needed *, 
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Table  4*2:  Data  Reduction  In  Percent 


speaker 

percent 

ds 

45.6 

fa 

33.4 

99 

39.4 

sw 

43.6 

average 

40.6 
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Abstract 

In  both  speaker  dependent  and  independent  word  recognition  the  selection  of  the  reference 
templates  is  recognized  as  a  crucial  step  In  regards  to  the  final  accuracy  of  the  system.  Presented 
here  for  a  speaker  dependent  system,  is  an  algorithm  which  chooses  a  reference  template  for  each 
word  in  the  vocabulary  from  a  set  of  N  exemplars.  The  goal  of  the  algorithm  is  to  produce  a  reference 
set  that  minimizes  the  worst  matching  behavior  and  total  error  over  the  N  sets  of  exemplars.  The 
results  of  the  experiments  presented  here  show  a  reduction  in  the  average  error  rate  from  16.4%  to 
10.2%  over  a  set  of  4  male  talkers  and  4  female  talkers. 

1.  Introduction 

An  Important  problem  in  isolated  word  recognition  is  the  creation  and  or  selection  of  the  reference 
templates.  Techniques  for  clustering  of  templates  [3]  [4]  have  been  developed  whioh  yield  multiple 
reference  patterns  in  speaker  independent  systems.  Our  experiments  Indicate  that  the  selection  of 
the  reference  templates  in  the  speaker  dependent  case  has  a  significant  effect  on  the  recognition 
accuracy  obtained.  The  technique  presented  in  this  paper  selects  a  single  optimal  template  for  each 
vocabulary  item  based  on  the  internal  consistency  of  matches  in  an  initial  training  set.' The  results  we 
obtained  with  our  template  selection  algorithm  produce  recognition  results  superior  in  all  cases  to 
those  results  obtained  when  no  template  selection  Is  done. 


2.  Word  Recognition  System 

Figure  2-1  shows  a  flow  diagram  of  the  system  [1]  used  in  these  experiments.  The  speech  data 
used  in  the  experiments  consists  of  10  repetitions  of  the  alphabet  and  digits  (36  utterances)  by  8 
talkers  (4  male,  4  female).  Each  talker  completed  two  repetitions  a  day  over  period  of  five  days.  Each 
repetition  was  spoken  in  a  different  a  randomized  order.  The  recording  was  done  in  an  office 
environment  using  c.  noise  canceling  microphone  and  high  quality  tape  recorder.  The  recorded 
speech  was  then  low  pass  filtered  at  4.5  kHz  and  digitized  at  10  kHz. 

2.1.  Signal  Processing 

The  raw  digitized  samples  are  taken  as  the  input  to  a  256  pt.  discrete  Fourier  analysis,  using  a 
20.0msec.  window  stepped  at  10.0msec.  intervals.  The  results  of  the  Fourier  analysis  are  then 
reduced  to  16  coefficients  by  summing  adjacent  values  in  the  spectrum  according  to  the  mel  scale 
(see  table  2-1).  These  16  coefficients  are  then  converted  to  log  dB.  Begin -End  analysis  proceeds  on 
the  log  dB  signal  by  computing  for, each  frdme,2  the  average  energy  and  the  difference  between  high 


^Frames  are  defined  as  a  set  of  16  coefficients  that  represent  20.0msec  of  signal. 


Log  dB  coefficients 


IS  4  bit  spectral 
derivative  coefficients 


Reference  Templates 


Unknown  Spectral  Pattern 


i 


Time  Warping 


Recognition  Results 


Figure  2- 1 :  Flow  Diagram  of  System 

and  low  frequency  energy  content,  these  two  parameters  are  then  used  In  the  begln-end  analysis. 


Noise  subtraction  is  accomplished  by  computing  an  average  noise  spectrum  and  subtracting  it 
from  each  frame  of  th.e  signal.  If  the  energy  level  of  a  coefficient  is  below  the  average  energy  per 
coefficient  in  the  noise  spectrum  after  the  noise  spectrum  Is  subtracted  then  that  coefficient  is  set  to 
that  average  energy  level.  Finally  the  coefficients  are  reduced  to  a  4  bit  magnitude  by  »aking  the 
derivative  with  respect  to  frequency. 


2.2.  Warping 

The  dynamic  programing  method  used  is'the  Itakura  warping  technique  [2].  Although  there  are 
several  other  dynamic  time  warping  algorithms  which  have  been  proposed,  the  Itakura  warping 
appears  to  give  the  most  consistent  results  over  a  variety  of  conditions.  The  metric  used  to  measure 
the  difference  between  the  test  and  reference  is  a  euclidean  distance. 
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Table  2*1:  Mel  Scale  Frequency  Boundaries1 


3.  Reference  Template  Selection 

As  previously  stated  the  goal  of  the  template  selection  algorithm  is  to  chose  a  reference  template 
set  from  the  training  set  that  will  provide  the  best  match  to  the  trainin':  set.  For  the  purposes  of  our 
discussion  the  first  5  repetitions  of  each  speaker  in  our  data  base  will  be  des.gnated  as  the  training 
data  sets.  The  last  5  repetitions  will  be  designated  as  the  test  data  sets.  Initially  we  are  interested  in 
what  the  results  of  the  recognition  are  if  we  do  no  template  selection  and  simply  allow  each  of  the 
training  data  sets  to  serve  in  turn  as  the  reference  templates  for  the  test  data  sots.  These  results  are 

presented  table  3-1 . 

As  can  be  seen,  the  error  rate  varies  a  great  deal,  depending  on  which  d*.  '  set  is  used  for  the 
reference  templates.  When  template  selection  is  done,  we  will  take  advantage  of  the  variance  In 
pronunciation  and  build  a  composite  set  of  reference  templates  that  exhibits  a  matching  behavior 
better  than  any  one  of  the  original  training  sets. 


3.1.  Selection  Algorithm 

The  algorithm  proceeds  by  addressing  the  problem  of  templates  belonging  to  utterances  that  are 
easily  confused.  The  key  point  being  that  the  differences  between  these  templates  is  not  always  large 
enough  to  discriminate  them  correctly  when  matched  with  an  unknown  utterance.  By  carefully 
selecting  templates  from  the  training  sets  we  can  increase  the  difference  between  conlusable 


in  the  litter  which  is  composed  ol  the  sum  the  specified  OFT  samples. 
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Male  Speaker 

Ml 

M2 

M3 

M4 

Reference 

Error  Rate 

Error  Rate 

Error  Rate 

Error  Rate 

1 

‘  17.2% 

22.8% 

32.8% 

7.2% 

2 

11.1% 

12.2% 

23.3% 

9.4% 

3 

7.9% 

12.8% 

27.8% 

5.0% 

4 

7.9% 

8.3% 

22.8% 

12.2% 

5 

7.2% 

11.7% 

24.4% 

5.0% 

Average 

10.2% 

13.6% 

26.2% 

7.8% 

Best 

7.2% 

8.3% 

22.8% 

5.0% 

Female  Speaker 

FI 

F2 

F3 

F4 

Reference 

Error  Rate 

Error  Rate 

Error  Rate 

Error  Rate 

1 

10.0% 

21.7% 

15.0% 

.21.7% 

2 

17.2% 

23.3% 

17.8% 

22.2% 

3 

15.7% 

21.1% 

16.1% 

20.0% 

4 

13.9% 

19.4% 

16.7% 

16.1% 

5 

.  12.2% 

23.9% 

17.8% 

25.0% 

Average 

13.8% 

21.9% 

16.7% 

21.0% 

Best 

10.0% 

19.4% 

15.0% 

16.1% 

Grand  Average  =  16.4% 

Average  of  Best  error  rates  *  13.1% 

Table  3-1 :  Recognition  results  with  no  template  selection. 


templates  thereby  reducing  the  error  rate  otherwise  obtained.  In  order  to  facilitate  the  discussion  of 

the  algorithm  we  shall  designate  U[e,w]  as  the  utterances  in  the  training  set,  Ml[r,t,w]  as  the  first 

choice  matching  behavior  and  M2[r,t,w]  es  the  second  choice  matching  behavior,  where 

e  =  exemplar  number,  e  ■  1,2, . . .  ,N 

w  »  word  number,  w  *  1,2, ...  ,W 

r  *  reference  exemplar,  r  *  1,2 . N 

t  •>  test  exemplar,  l  »  1,2 . N 

3.1 .1 .  Candidate  Selection 

Consider  the  first  choice  matching  behavior  for  word  w’  which.we  denote  as  Mlw'[r,t].  In  figure  3-1 
we  see  an  example  of  the  first  choice  matching  behavior  for  the  vocabulary  item  "f".  The  first  choice 
matching  behavior  for  a  particular  word  in  a  particular  test  dataset  is  defined  as  the  score  obtained 
and  the  word  recognized  given  a  particular  reference  dataset.  In  our  example  we  see,  for  Instance, 
that  the  "f"  in  test  dataset  2  is  indeed  recognized  as  an  "f"  with  a  score  of  53  when  dataset  1  is  used 
as  the  reference. 

For  each  reference  dataset  we  observe  that  there  will  be  a  range  of  scores  obtained  over  the  N  test 
datasets.  For  each  reference  the  worst  score  over  the  N  test  datasets  is  picked  out  and  defined  as  the 
worst  matching  behavior  for  that  reference.  Let  WMlw'[r]  be  Max  Mlw'(r,t]  denote  the  worst- 
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Mlw’(rt) 

reference  dataset 


test 

1 

1 

O- 

2 

50/x 

dataset 

2 

53/f 

0 

3 

55/f 

53/f 

4 

43/f 

54/f 

5 

[siTH 

60/f 

3 

54/f 

4 

44 /f 

5 

64/f 

57/f 

63/f 

69/f 

0 

46/f 

53/f 

43/f 

0 

60/f 

52/f 

52/f 

0 

WMIw’(r) 


62  60  57  63  69 


Figure  3-1 :  First  Choice  Matching  Behavior  of  "f" 


matching  behavior  for  each  reference  r  over  the  t  test  exemplars  for  word  w\  In  figure  3-1  the  worst 
matching  behavior  for  each  of  the  N  references  is  boxed. 


Once  the  vector  (WMIw'(r))  containing  the  worst  matching  behavior  for  each  of  the  references  is 
lormed  we  choose  the  candidate  template  for  w'  as  U[r’,w']  such  that  r’  is  the  Min  WM1w'[r]  over  the  N 
.Terences.  That  is.  the  reference  dataset  that  has  the  minimum  worst  matching  behavior  becomes 
mir  candidate  dataset.  In  our  example  the  candidate  template  for  w'  is  in  dataset  3. 

3.1.2.  Verification 

In  order  to  verify  that  U[r',w ]  is  indeed  the  best  candidate  for  w’  we  must  establish  that  the 
matching  behavior,  Mlr'w'[t],  over  the  /  test  exemplars  does  the  following: 

•  Provides  a  correctly  recognized  word. 

•  Has  a  match  distance  that  is  less  than  any  wrong  first  choice  recognition. 

•  Has  a  match  distance  that  is  less  than  all  second  choice  recognitions  in  M2w'[r,t]  over  all 
r  for  r\ 


Using  figuie  3-1  we  can  check  the  first  two  conditions.  We  observe  that  dataset  3  meets  the  first 
condition  since  it  provides  a  correct  recognition  of  "f"  for  the  other  four  datasets. 


Checking  the  second  condition  we  see  that  the  "f”  from  dataset  1  is  recognized  as  an  "x"  with  a 
score  of  50  when  dataset  2  is  used  as  the  reference.  This  fails  to  meet  the  second  condition  since  the 
recognition  for  "f”  in  our  candidate  dataset  (3)  has  a  score  of  54.  Since  this  is.the  case,  choosing  the 
T'  from  dataset  3  may  possibly  lead* to  inherent  error  in  our  selected  dataset.  This  inherent  error 
would  arise  if  the  "x"  from  dataset  2  was  chosen  as  part  of  our  selected  dataset  In  that  case  an 
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incorrect  recognition  result  for  the  "f"  from  dataset  1  would  occur  v/hen  the  selected  dataset  was 
used  as  the  reference. 

Using  figure  3-2  we  can  check  the  inal  condition.  We  observe  that  the  second  choice  matching 
behavior  for  "f"  from  dataset  2  produces  a  score  of  55  for  an  "s"  from  dataset  2.  This  can  lead  to 
Inherent  error  in  the  same  way  as  described  for  the  second  condition.  Thus,  the  candidate  template 
falls  to  meet  the  third  condition. 


M2w’(rt) 

reference  dataset 


1 

2  * 

3 

4 

5 

teat 

1 

87/x 

70/f 

54/f 

72/a 

e 

56/a 

dataset 

2 

78/m. 

55/a 

57/f 

80/x 

67/m 

*  3 

62/m 

46/a 

0/f 

79/x 

60/m 

4 

BI/I 

75/x 

43/f 

67/a 

58/s 

5 

86/f 

99/a 

52/f 

80/s 

87/x 

LFirstChoice  Matching 
Behavior  of  dataset  3 
M1w,(3,t) 

Figure  3-2:  Second  Choice  Matching  Behavior  of  "f" 

In  the  event  that  all  of  these  conditions  are  satisfied  then  U[r',w ']  Is  a  good  template  for  w\  meaning 
that  using  it  will  not  lead  to  Inherent  error  when.the  selected  dataset  is  used  as  the  reference  for  our 
training  datasets.  However,  for  a  majority  of  utterances  a  good  template  is  not  available  since  the 
discriminability  between  these  utterances  is  too  small.  In  order  to  minimize  the’ Inherent  error,  the 
choice  of  a  best  w ’  is  made  with  reference  to  the  entire  set  of  training  templates. 

This  procedure  consists  of  selecting  p  additional  candidates  for  w\  These  candidates  are  chosen 
by  increasing  magnitude  of  WMlw’[r],  When  one  or  more  candidates  have  been  selected  for  all  IV 
words,  the  inherent  error  for  all  combinations  of  the  p  candidates  is  computed  among  those  words 
which  did  not  have  a  good  template.  The  combination  of  W  templates  that  produces  the  least 
inherent  error  is  then  used  as  the  selected  template  set.  A  potential  draw  back  of  this  procedure  Is 
that  p  must  be  kept  small  since  the  number  of  combinations  to  compute  grows  exponentially  with  p. 
The  data  reported  in  this  paper  are  based  on  template  selection  using  ap  of  2. . 


7 


4.  Recognition  Results  and  Discussion 


Sp<?a.kei 

New  Error  Rate 

Error  Rate 

Average 

%lmprovement 

Error  Rate 

Best 

^Improvement 

Ml 

5.6% 

10.2% 

45.0% 

7.2% 

22.2% 

M2 

7.8% 

13.6% 

42.6% 

8.3% 

6.0% 

M3 

18.3% 

26.2% 

30.1% 

22.8% 

19.7% 

M4 

1.1% 

7.8% 

85.8% 

5.0% 

78.0% 

FI 

7.2% 

13.8% 

47.8% 

10.0% 

28.0% 

F2 

16.7% 

21.9% 

23.7% 

19.4% 

13.9% 

F3 

10.6% 

16.7% 

36.5% 

15.0% 

29.3% 

F4 

14.4% 

21.0% 

31.4% 

16.1% 

10.5% 

Average 

10.2% 

16.4% 

42.8% 

13.1% 

25.9% 

Table  4- 1 :  Recognition  Results  using  Template  Selection 


If  we  examine  the  results  obtained  (Table  4-1)  when  this  algorithm  for  reference  template  selection 
Is  used,  we  see  an  Improvement  over  the  best  results  obtained  for  each  speaker  in  the  case  where  no 
template  selection  Is  done.  The  average  expected  Improvement  over  the  average  expected 
recognition  results  Is  given  as  42.8%.  However  this  percentage  might  be  expected  to  decrease  with  a 
smaller  number  of  exemplars  in  the  training  set.  Likewise  a  larger  number  of  exemplars  would 
probably  result  in  a  case  of  diminishing  returns  On  recognition  improvement.  While  this  algorithm 
features  the  intuitively  attractive  feature  of  using  a  real  template  as  opposed  to  a  synthetic  one,  this 
feature  will  probably  lead  to  poor  results  in  the  case  of  speaker- Independent  recognition. 
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